- The big picture: human languages evolve on a cultural timescale:
- individual utterance selection > language change > language evolution
- Massive centuries-spanning corpora compiled in the recent years open up an unprecedented avenue of possible investigations into language dynamics.
- (cf. Cuskley et al., 2014; Feltgen et al., 2017; Frermann and Lapata, 2016; Gulordava and Baroni, 2011; Hamilton et al., 2016; Newberry et al., 2017; Petersen et al., 2012; Bocharev et al., 2014; Sagi et al., 2011; Schlechtweg et al., 2017; Wijaya and Yeniterzi, 2011)
- Word usage frequencies but also word meaning using distributional semantics methods
- What I’m interested in: as new words - e.g. neologisms & borrowings - are selected for, what happens to their older synonyms?
- Identified two confounds that need to be controlled for
- Results based on simply counting words can lead to spurious results
- a big change may well be driven by a change in topic composition (1)
- Automatic distribution-based similarity measures are useful for quantifying both meaning and meaning change
- but apparent semantics tend to change when frequency changes (2)
- Observation: the ebb and flow of discourse topics in a diachronic corpus reflects real-world events (wars->ware-related news->frequency of military words increases)
- Token frequency ~ probability of usage ~ fitness ~ being selected for
- However: corpus frequencies may be misleading (Chelsey & Baayen, 2010; Lijffijt et al., 2012; Calude, et al., 2017; Szmrecsanyi 2016)
- Observation: sometimes similar words both increase in frequency, instead of competing; or emergence of new words often coincides with the frequency increase of similar words, not decrease.
- Frequency change might not necessarily imply selection.
- Topical advection: a measure of how much topic/context words like mocha have changed on average (weighted by some association score) between two periods.
- latte: calculate its log frequency change (e.g. +1.19 between 1990s->2000s)
- calculate its topical advection: +0.07 (weighted mean log frequency change in context words) (see Appendix for math)
Works similarly in other diachronic databases of cumulative culture (e.g. click here: movies, boardgames, cookbooks)
A useful baseline to include in any model of diachronic frequency change of linguistic (or other cultural) elements.
- The two confounds that need to be controlled for…
- Topical fluctuations ✔
- Interplay of frequency change and semantic change measures
- Observation: frequency change in a word appears to affect its (distributional) semantics.
- Distributional semantics ~ topic modelling, all based on contextual co-occurrence; semantic change ~ semantic (self-)similarity of a word between temporal subcorpora
- If the frequency difference of a word between two time periods affects its semantics, this would be a problem for semantic change measures
(cf. Dubossarsky et al. 2017 for more critique of automated semantic change measures)
- Simulate frequency change of a word between subcorpora and measure semantics change
- But instead of actual different subcorpora, use data from one single corpus (2000-2009 in COHA), and generate different versions of it (corpus\('\)) where the occurrences of some target word \(w\) have been downsampled by relabelling a fixed portion of them as \(w'\)
- Measure the similarity of \(w\) in the original corpus -> to \(w'\) in corpus\('\).
- Null hypothesis: no semantic change should occur (actually the same word)
- 100 random words (nouns) from equally spaced log frequency bands, 25 downsample sizes \(s \in [0.1, 7]\)
- For each \(w\) with frequency \(f\), and each \(s\), relabel a portion \(e^{ln(f) - s} = f/e^s\) (excl. downsamples \(n<10\))
- E.g., if \(f=1000\), \(s=0.7\), then \(1000/e^0.7 \approx 496\), or a -50.3% reduction.
- For each downsampled \(w'\), measure its semantic similarity to the original word, using 5 different distributional approaches (with 10x replications for each combination):
- full count vectors (no dimension reduction), cosine similarity
- full vectors, but PPMI weighted, cosine similarity
- APSyn rank-based similarity, using top 100 PPMI-weighted terms (Santus et al. 2016)
- Latent Semantic Analysis (SVD) embeddings of count vectors, cosine similarity
- GloVe embeddings of count vectors, cosine similarity (Pennington et al. 2014)
- All 5 semantic similarity methods exhibit the bias, but the extent is variable; the bias is more predictable by frequency band in some methods, less in others
- Vector space density matters: a large change value does not necessarily correspond to a categorical change in semantics in a sparse space; but similarity rank between \(w\) and \(w'\) is comparable between methods
- Good news: change to the extent of becoming a “different word” (\(w\) not the closest synonym for \(w'\)) occurs mostly at low frequencies (<100), which should be considered unreliable anyway
- Some methods (APSyn, GLoVe) are more susceptible, while the very simple method of measuring the cosine over an unreduced PPMI-weighted vector space performs best.
- The downsampling apporach is extendable to actual diachronic corpora, to compare observed semantic change against expected change stemming from frequency difference.
- As new words are selected for, what happens to their older synonyms?
- Observation: competition may manifest in at least two ways:
- the losing variant decreases in usage frequency
- or it changes meaning, while the form remains in use (e.g. radio <-> wireless, beef <-> cow) )
- …but at times near-synonyms both successfully remain in use
- Hypothesis: high semantic similarity (introduced by emergent novel words or semantic change) leads to competition between similar variants1 - unless there is sufficient communicative need2 in the lexical subspace to sustain near-synonymy.
- 1 apparent by diverging frequency or diverging semantics, but the semantic change must be controlled for bias
2 as measured by the advection model
- All the code, the slides with interactive plots, link to full paper: https://andreskarjus.github.io
The advection value of a word in time period \(t\) is defined as the weighted mean of the changes in frequencies (compared to the previous period) of those associated words. More precisely, the topical advection value for a word \(\omega\) at time period \(t\) is
\[\begin{equation} {\rm advection}(\omega;t) := {\rm weightedMean}\big( \{ {\rm logChange}(N_i;t) \mid i=1,...m \}, \, W \big) \end{equation}\]where \(N\) is the set of \(m\) words associated with the target at time \(t\) and \(W\) is the set of weights (to be defined below) corresponding to those words. The weighted mean is simply
\[\begin{equation} {\rm weightedMean}(X, W) := \frac{\sum x_i w_i }{\sum w_i} \end{equation}\]where \(x_i\) and \(w_i\) are the \(i^{\rm th}\) elements of the sets \(X\) and \(W\) respectively. The log change for period \(t\) for each of the associated words \(\omega'\) is given by the change in the logarithm of its frequencies from the previous to the current period. That is,
\[\begin{equation} {\rm logChange}(\omega';t) := \log[f(\omega';t)+1] - \log[f(\omega';t-1)+1] \end{equation}\]where \(f(\omega';t)\) is the number of occurrences of word \(\omega'\) in the time period \(t\). Note we add \(1\) to these frequency counts, to avoid \(\log(0)\) appearing in the expression.
For each \(w\) with an original frequency \(f\), and each \(s\), downsampled by randomly relabeling a fixed portion of its occurrences as \(w'\) in the corpus, where the portion is defined as \(e^{ln(f) - s} = f/e^s\) (exluded downsamples with \(<10\) occurrences)
E.g., if \(f=1000\), \(s=0.7\), then \(e^{ ln(1000) - 0.7} \approx 496\), or a -50.3% reduction.
*This research was supported by the scholarship program Kristjan Jaak, funded and managed by the Archimedes Foundation in collaboration with the Ministry of Education and Research of Estonia.